Learning and predicting time series by neural networks 
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' Artificial neural networks which are trained on a time series are supposed to achieve two abilities: 

firstly to predict the series many time steps ahead and secondly to learn the rule which has produced 
the series. It is shown that prediction and learning are not necessarily related to each other. Chaotic 
sequences can be learned but not predicted while quasiperiodic sequences can be well predicted but 
not learned. 
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Neural networks are able to learn a rule from a set of examples. This paradigm has been used to construct adaptive 
algorithms - named artificial neural networks - which are trained on a set of input/output patterns generated by 
an unknown function. After the training process, the network can reproduce the patterns, but is also has achieved 
generalization: it has obtained some knowledge about the unknown function. 

In the simplest case the unknown function is a neural network itself, the "teacher" . A different neural network 
with an identical architecture, the "student" , is trained on a set of examples produced by the teacher. This so called 
"student/teacher" scenario has been intensively studied using models and methods of statistical physics ^ ||, Q. 
Recently these methods have also been applied to learning and generation of time series ||, |(| 0, ||, ||] . 

The main result of these theoretical investigations is that as the student network receives more information it 
increases its similarity to the weights of the teacher network. When the number of training examples is much larger 
than the number of parameters of the teacher, the student is almost identical to the teacher and the generalization 
error is close to zero. In this article we show that this fundamental relation between learning and generalization 
is violated when a neural network is trained on a time series. We present a class of networks with almost perfect 
■ prediction of the series and almost zero information about the rule. The opposite case is found, as well: A network 

cannot predict a time series although it is almost identical to the rule generating the series, 
i Hence the intuitive deduction that learning a rule leads to good generalization and good generalization indicates 
ly— ^ ' good knowledge about the rule is violated both ways when a neural network is trained on a time series. 

, We find this phenomenon already for a simple perceptron, a neural network with a single layer of synap- 
tic weights, given by the equation o = g(w-S). Here w = (w\, w%, tujv) is the vector of synaptic weights, 
S = (s t _i, Sj_2, H-n) is the input of the network (window of the time series), o is the output value and N is the 
size of the network. In the following we will study different transfer functions g(x). Such a perceptron can be used 
as a sequence generator (teacher with weights w T ) as well as a network being trained on a time series (student with 
weig hts w s )§. 

The sequence is generated by a teacher network with random weights, starting from random initial conditions 
, (sat, sn-i, ■ si); hence it is defined by the equation 
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We define a time to in such a way that the sequence is stationary for any t > to. Here, "stationary" means that the 
sequence lies on its attractor. The transient, which is of O(N) is not included in the training examples. 

The training error is calculated from the absolute value of the deviation between the sequence St and the corre- 
sponding output o t of the student: 

, to+T 

6 = t-k^o T £ l s *-°'l ( 2 ) 

t=t + l 

This is the average error of a one-step-prediction of the student on the time series. Perfect training leads to zero error 
e, meaning that each number of the sequence is correctly reproduced: s t = o t . 

The student's knowledge about the unknown parameters is measured by the overlap R between the weight vectors 
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of the teacher and the student: 



R = 
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If the transfer function is continuous, it is also important that the two vectors coincide in their length Q s = Q T with 
= |w|. 

First we discuss the Boolean perceptron, g(x) = sign(x), of size N which has generated a periodic bit sequence 
|^|, [t| . The teacher perceptron has random weights with zero bias, and the cycle is related to one component of the 
power spectrum of the weights. The student network is trained using the perceptron learning rule: 
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Awf = jjS t St-i if st^WjSt-j < 0; 
Awf = else. 



(4) 



For this algorithm there exists a mathematical theorem [gj : If the set of examples can be generated by some perceptron 
then this algorithm stops, i.e. it finds one out of possibly many solutions. Since we consider examples from a bit 
sequence generated by a perceptron, this algorithm is guaranteed to learn the sequence perfectly. 

The network is trained on the cycle until the training error is zero. Hence the student network can predict the 
stationary sequence perfectly. It turns out that the overlap between student and teacher remains small, in fact it is 
zero for infinitely large networks, N — ► oo. Although the network predicts the sequence perfectly, it does not gain 
much information on the parameters of the network which has generated this sequence. 
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FIG. 1: Final overlap R between student and teacher network after training, as a function of the size N of the network. The 
standard error-bars result from M = 100 individual runs. A linear fit of R vs. TV" 1 / 2 supports the statement that R — » for 
N — > oo. 



This situation seems to be different in the case of a continuous perceptron. Inverting Eq. (Q) for a monotonic 
transfer function g(x) gives N linear equations for N unknowns wf . If all patterns are linearly independent then 
batch training, using N windows, leads to perfect learning. 

A network with transfer function g(x) = tanh(/3x) generates a quasiperiodic time series, if the parameter /3 is larger 
than a critical value j3 c [[| . The form of the sequence is characterized by an attractor of dimension equal to one and 
analytically in the leading order it is given by 



s t = tax)h{Acos{2iT qt/N)), 



(5) 



with some gain A(f3), which is non-zero above the bifurcation point j3 c . Note, that in the typical sequence there's only 
a contribution of one non-integer wavenumber q, which is related to one dominant Fourier component of the couplings 
w T , see [|] for details. 
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FIG. 2: Return map for a quasiperiodic (left) and chaotic (right) time series used for training a perceptron as described in the 
text 



When trying to find the couplings w T by inverting the set of Eq. ^, it turns out that even professional computer 
routines often fail to perform the required matrix inversion: the patterns are almost linearly dependent. Some 
explanation for that can be found from Eq. |^. For small A, the tanh in Eq. ^| can be approximated by its argument 
and one can easily show that St+2 — —St + 2 cos(27rq/A^)s t+ i. Therefore, any window of the sequence can be written 
as a linear combination of two basis vectors. In case we expand the tanh in Eq. |s| up to the p's term, one can 
show that the form of s t + m is given by s t+m (p) — J2k=a B2k+i{cos(2irq/N{2k + 1)) — sm(2irq/N(2k + 1))), since 
cos(x) 2p+1 = X^2fc+i cos((2&; + l)x) where Ck and Bk are constants. On one hand, as long as p is less than the 
window size N, the inputs are linearly dependent and Eq. [5| cannot be inverted. On the other hand, the power 
expansion of the tanh indicates that B p drops exponentialy with p. Thus, the linear dependence of the iV-dimensional 
inputs is lifted only by the p = N + 1 term in the expansion which decreases exponentially as N increases. This is 
the source for the ill-conditioned problem of inverting Eq. |5| 

Hence, in particular for large dimensions N, batch learning does not work well for quasiperiodic time scries generated 
by a teacher perceptron. 

How does this scenario show up in an on-line training algorithm for a continuous perceptron? If a quasiperiodic 
sequence is learned step by step using gradient descent to update the weights, without iterating previous steps, 

N 

Aw? = ^(s t - g{h)) ■ g'(h) ■ s t ^ with h = p^tf^-i ( 6 ) 

3=1 

we find two time scales (time = number of training steps): (i) A fast one increasing the overlap between teacher and 
student to a value which is still far away from perfect agreement, R = 1 and Q s = Q T . During this phase, the training 
error goes down to nearly zero, (ii) A slow one further increasing the overlap and still decreasing the training error. 

Since the second time scale is usually several orders of magnitude larger than the first one, we could not observe 
R = 1 within our numerical simulations. Although there is a mathematical theorem on stochastic optimization which 
seems to guarantee convergence to zero training error (j^) which implies full overlap R = 1 with Q s = Q T ', our 
on-line algorithm cannot gain much information about the teacher network, at least within practical times. 

This is completely different for a chaotic time series generated by a corresponding teacher network with g(x) = 
sin(/3 x) |J. It turns out that learning the chaotic series works like learning random examples: After a number of 
training steps of the order of N the overlap R relaxes exponentially fast to perfect agreement between teacher and 
student, R = 1. The same behavior can be observed for the length Q of the student, which approaches exponentially 
fast to the length of the teacher. 

Here are some details of the numerical calculations: Our simulations were performed with the same (random) 
teacher weights for the quasiperiodic and the chaotic case. Furthermore, the random initialization of the student 
networks were identical. The settings differ only in the choice of the transfer-functions g{x). Return maps for the two 
sequences are shown in Fig. 0. 
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FIG. 3: Overlap R as a function of the fraction a — t/N of training examples. The upper curve shows the learning dynamic for 
the chaotic case, the lower one shows the two time scales for the training on the quasiperiodic series. Both settings start with 
the same initial overlap (Rq ~ —0.16). At a ~ 0.5 the dynamics of the quasiperiodic case enters the part with slow progress. 



Starting with the same initial overlap, the students were trained according to Eq. (Q) until they achieved a certain 
training error (e = 0.008). In both cases this took about 25N learning steps, the network dimension was N = 50. 
After the training process however, the students ended up with completely different weight vectors. In case of the 
chaotic sequence, the student's weights came close to the one of the teacher (R — > 1,Q — * Q T )- In contrast, the 
student of the quasiperiodic sequence did not obtain much information about the teacher, and its weights remained 
nearly perpendicular to the teacher ones (R ~ 0). The time evolution of the respective overlaps during training is 
shown in Fig. [|. 

One important question remains: How well can the student predict the time series? In order to evaluate the 
training success, we have defined a one-step-error in Eq. (||). Now we are interested in the long-term prediction of the 
students. Therefore, the student perceptrons have to act as sequence generators themselves, using their own output 
to complete the next input window. Starting from a window of the teacher's sequence, i.e. (oj, Ot-i, ...Ot-N+i) = 
(s t ,s t -i, ...St-N+i), the student's prediction r steps ahead is given by iterating Eq. ([!]) up to 
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The prediction error e(r) is the average absolute deviation of this value with the respective item of the teacher's 
sequence, 
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Note, that the average is performed by changing the initial time window. Again, to is used to indicate any time step 
of the stationary part of the sequence. To calculate e(r) in the simulations, we have chosen T = N = 50. The result 
is shown in Fig. ^. 

The graph shows the prediction error as a function of the time interval over which the student makes predictions. 
Both curves coincide in the first value, which is equal to the training error at which learning was stopped. 

The student network which has been trained on the quasiperiodic sequence can predict it very well. The error 
increases linearly with the size of the interval, even predicting 25A^ steps ahead yields an error of less than 5% of 
the total possible range. On the other side, the student trained on the chaotic sequence cannot make long-term 
predictions. The prediction error increases exponentially with time until it is of the order of random guessing. 
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FIG. 4: Prediction error as a function of time steps ahead (measured in multiples of N: a — t/N), for the quasiperiodic (lower) 
and the chaotic (upper) series. 



Of course, if the student would reproduce the series perfectly, it would also predict it without errors. But since we 
stop our algorithm when the training error is close but not identical to zero, we achieve two different states: For the 
quasiperiodic sequence the weight vector of the student recovers the main Fourier component of the teacher which 
reproduces the sequences reasonably well. There remains a large space of weight vectors which can generate the same 
sequence. For the chaotic sequence, however, all the weights of the students come extremely close to the ones of the 
teacher; but due to sensitivity to model parameters, any prediction of the sequence is impossible. 

All of our results stem from numerical simulations. We find that the quantitative details of our results strongly 
depend on the parameters of our model. Hence we did not succeed to derive quantitative results about scaling of 
learning times with system size N or the Ljuapunov exponent as a function of the fractal dimension of the chaotic 
time series. 

In summary we obtain the following result: 

(i) A network trained on a quasiperiodic sequence does not obtain much information about the teacher network 
which generated the sequence. But the network can predict this sequence over many (of the order of N) steps ahead, 
(ii) A network trained on a chaotic sequence, however, obtains almost complete knowledge about the teacher network. 
But due to the chaotic nature of the sequence, this network cannot make reasonable predictions. 
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